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Abstract 

The Vapnik-Chervonenkis (VC) dimension measures the complexity of a learn¬ 
ing machine, and a low VC dimension leads to good generalization. The recently 
proposed Minimal Complexity Machine (MCM) learns a hyperplane classifier 
by minimizing an exact bound on the VC dimension. This paper extends the 
MCM classifier to the fuzzy domain. The use of a fuzzy membership is known 
to reduce the effect of outliers, and to reduce the effect of noise on learning. 
Experimental results show, that on a number of benchmark datasets, the the 
fuzzy MCM classifier outperforms SVMs and the conventional MCM in terms 
of generalization, and that the fuzzy MCM uses fewer support vectors. On 
several benchmark datasets, the fuzzy MCM classifier yields excellent test set 
accuracies while using one-tenth the number of support vectors used by SVMs. 
Keywords: Machine Learning, Support Vector Machines, VC dimension, 
complexity, generalization, fuzzy SVMs 


1. Introduction 


Support vector machines are amongst the most widely used machine learning 
techniques today. The most commonly used variants are the maximum margin 
Li norm SVM 1], and the least squares SVM (LSSVM) Q, both of which require 
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the solution of a quadratic programming problem. The proximal SVM |3( is also 
similar in spirit to the LSSVM. SVMs were motivated by the celebrated work 
of Vapnik and his colleagues on generalization, and the complexity of learning. 
The capacity of a learning machine may be measured by its VC dimension, and 
a small VC dimension leads to good generalization and low error rates on test 
data. 


However, according to Burges |J], SVMs can have a very large VC dimension, 
and that “at present there exists no theory which shows that good generalization 
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performance is guaranteed for SVMs”. In recent work [5f , we have shown how 
to learn a bounded margin hyperplane classifier, termed as the Minimal Com¬ 
plexity Machine (MCM) by minimizing an exact bound on its VC dimension. 
Experimental results on many benchmark datasets confirm that in compari¬ 
son to SVMs, the MCM generalizes well while using significantly fewer support 
vectors, often lower by a factor between 10 and 50. 

Classically, each training sample in a binary classification setting is treated 
equally and is associated with a unique class. However, in reality, some training 
samples may be corrupted by noise; this could be noise in the sample’s location 
or in its label. Such samples may be thought of as not lying entirely in one 


SVMs are very sensitive to outliers 


class, but belonging to both classes to a certain degree [§]. It is well known that 


Fuzzy support vector machines 


(FSVM) [§| were proposed to address this problem. In FSVMs, each sample 
is assigned a fuzzy membership which indicates the extent to which belongs to 
any one class. The membership also determines the importance of the sample in 
determining the separating hyperplane. Consequently, the measurement of the 
empirical error in a fuzzy setting does not treat all samples equally. Discounting 
errors on outlier samples can allow hyperplanes with larger margins to be learnt, 
and can also obviate the effect of noise to a considerable degree. 

This paper extends the MCM into the fuzzy domain, by attempting to learn 
a gap tolerant, or fat margin fuzzy classifier with low VC dimension. The fuzzy 
MCM objective function consists of two terms. The first term is related to the 
VC dimension of the classifier, and minimization of this term yields a classi- 
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fier with good generalization properties. The second term is a weighted sum of 
misclassification errors over the training samples; the weights are dependent on 
the fuzzy memberships of the samples, and samples that are outliers contribute 
less to the overall error measure. The fuzzy MCM optimization problem thus 
tries to find a hyperplane with a small VC dimension, that minimizes the fuzzy 
weighted empirical error over training data samples. The use of fuzzy member¬ 
ships allows importance to be attached to individual samples, and hence helps 
improve generalization by not assigning equal importance to the misclassifica¬ 
tion error contributions of different samples; this reduces the effect of outliers. 
The fuzzy Minimal Complexity Machine, as the proposed approach is termed, 
dramatically outperforms conventional SVMs in terms of support vectors used, 
while yielding better test set accuracy. The effect of the approach to minimizing 
VC dimension may be guaged from the fact that on several datasets, the number 
of support vectors is more than fifteen times smaller than those used by SVMs. 
As we show in the sequel, an interesting example is that of the haberman’ 
dataset from the UCI machine learning repository 1[|, that has 306 samples. 
A fuzzy MCM classifier learnt using 80% of the dataset yields a classifier that 
can be written as a closed form expression involving only 4 support vectors. In 
comparison, a SVM classifier uses about 73 support vectors. 

The rest of the paper is organized as follows. Section [5] briefly describes the 
MCM classifier, for the sake of completeness. Section [3] shows how to extend the 
approach to learn a linear fuzzy MCM classifier, and section [4] then extends this 
work to the kernel case. Section [5] is devoted to a discussion of results obtained 
on selected benchmark datasets. Section ED contains concluding remarks. 


2. The Linear Minimal Complexity Machine Classifier 


The motivation for the MCM originates from some outstanding work on 
generalization [ill Q, H, Q• 

Consider a binary classification dataset with n-dimensional samples x l ,i = 
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1,2, M, where each sample is associated with a label y-i £ {+1, —1}- Vapnik 
[13j showed that the VC dimension 7 for fat margin hyperplane classifiers with 
margin d > d m i n satisfies 

j ^>2 

7 < 1 + Min(— 2 —, n) (1) 

U'min 

where R denotes the radius of the smallest sphere enclosing all the training 
samples. Burges, in dj], stated that “the above arguments strongly suggest that 

n2 

algorithms that minimize 73 - can be expected to give better generalization perfor¬ 
mance. Further evidence for this is found in the following theorem of (Vapnik, 
1998), which we quote without proof”. 


Following this line of argument leads us to the formulations for a hyperplane 
classifier with minimum VC dimension; we term the same as the MCM classifier. 
We now summarize the MCM classifier formulation for the sake of completeness. 
Details may be found in B- 

Consider the case of a linearly separable dataset. By definition, there exists 
a hyperplane that can classify these points with zero error. Let the separating 
hyperplane be given by 

u T x + v = 0. (2) 


Let us denote 


In 


B 


h _ Maxj=i, 2 ,...,M yi(u T x l + v ) 

Mini=i, 2 ,...,M yi{u T x l + v) 

, we show that there exist constants a, /3 > 0, a, (3 G 5ft such that 


( 3 ) 


ah 2 < 7 < fth 2 , 


( 4 ) 


or, in other words, h 2 constitutes a tight or exact ( 9 ) bound on the VC dimen¬ 
sion 7 . An exact bound implies that h 2 and 7 are close to each other. 


Figure [T| illustrates this notion. It is known that the number of degrees of 
freedom in a learning machine is related to the VC dimension, but the connec¬ 
tion is tenuous and usually abstruse. Even though the VC dimension 7 may 
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have a complicated dependence on the variables defining the learning machine, 
the VC dimension 7 is bounded by multiples of h 2 from both above and below. 
The exact bound h 2 is thus always “close” to the VC dimension, and minimizing 
h 2 with respect to the variables defining the learning machine allows us to find 
one that has a small VC dmension. The use of a continuous and differentiable 
exact bound on the VC dimension allows us to find a learning machine with 
small VC dimension; this may be achieved by minimizing h over the space of 
variables defining the separating hyperplane. In the case of a hyperplane clas¬ 
sifier, the only variables are u and v, and a hyperplane classifier with a small 
VC dimension is obtained by minimizing h 2 with respect to these variables. 



Space of learning machine variables 


Figure 1: Illustration of the notion of an exact bound on the VC dimension. 
Even though the VC dimension 7 may have a complicated dependence on the 
variables defining the learning machine, the VC dimension 7 is bounded by 
multiples of h 2 from both above and below. The exact bound h 2 is thus always 
“close” to the VC dimension, and minimizing hr with respect to the variables 
defining the learning machine allows us to find one that has a small VC dmen¬ 
sion. 


The MCM classifier solves an optimization problem, that tries to minimize 
the machine capacity, while classifying all training points of the linearly sepa¬ 
rable dataset correctly. This problem is given by 


Minimize h 

U>V 


Maxj-i ;2 ,...,M yi(u T x l + v) 
Min; = i > 2,...,M yi{u T x l + v ) ’ 


(5) 
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that attempts to minimize h instead of h 2 , the square function (-) 2 being a 
monotonically increasing one. 

This optimization problem is both quasiconvex and pseudoconvex. In 
we further show that the optimization problem ([5]) may be reduced to the linear 
programming problem 


Min h ( 6 ) 

w,b,h 

h > yi ■ [w T x 1 + b\, i = 1,2, ...,M (7) 

Hi ■ [w T x l + b\ > 1, i = 1 ,2, (8) 

where w € SR”, and b,h G 5i. We refer to the problem (0 - © as the hard 
margin Linear Minimum Complexity Machine (Linear MCM). 

In practice, the datasets may not be linearly separable. In such a case, we 
seek a classifier with a minimal VC dimension that has a small mis-classification 
error on the training samples. Such a hyperplane may be found by solving the 
soft margin MCM formulation, that is given by 


M 


Min h + C ■ qi 

w,b,h,q * ^ 

i=l 

(9) 

h > yi ■ [w T x l + b)+ q it * = 1,2,..., M 

( 10 ) 

yi ■ [w T x l + b] + qi> 1, i= 1, 2,..., M, 

( 11 ) 

qi> 0, i = 1,2,..., AT. 

( 12 ) 


Once w and b have been determined by solving the class of a test 

sample x may be determined from the sign of the discriminant function 

f(x) = w T x + b (13) 

3. The Fuzzy Minimal Complexity Machine Classifier 

In the linear soft margin MCM formulation ©-G3, the error variables 
qi,i = 1 , 2 measure the mis-classification error on the respective data 
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samples, and the second term of the objective function in © is a weighted sum 
of all the mis-classification errors. In this case, the hyper-parameter C equally 
weights all variables (jj,: this effectively means that errors made on all samples 
are equally important. In reality, noise tends to corrupt training samples, and 
robust learning requires us to ignore outliers, by assigning reduced importance 
to samples on which one has less confidence. 

Some samples may not be representative of a class. For example, a person 
showing some symptoms of a disease may have characteristics that overlap with 
both healthy subjects as well as unhealthy ones. Therefore, the membership of 
the class to which a sample belongs tends to be fuzzy, with a fuzzy membership 
(a value between 0 and 1) indicating the extent to which the sample may be 
said to belong to one class or the other. Samples with a higher membership 
value can be thought of as more representative of that class, while those with 
a smaller membership value should be given less importance when building a 
classifier. 

Consider the samples in Fig. [2j Four outlier samples have been highlighted 
by arrows. The hyperplanes found before and after discounting outlier samples 
have been shown in (a) and (b), respectively. Two of the outliers are in black, 
and discounting them would allow us to obtain a hyperplane with a smaller VC 
dimension. Two of the outliers are marked in red, and these make the data set 
linearly non-separable. Discounting classification errors on these red coloured 
samples would allow for a more robust classifier to be learnt. 


Lin and Wang proposed fuzzy SVMs in [8|, wherein they suggested that 
each sample be associated with a fuzzy membership Si. This membership value 
determines how important it is to classify a data sample correctly; samples 
with lower values of the membership function are less representative of the class 
to which they have been assigned, and can therefore be mis-classified without 
incurring the same penalty. In the example of Fig. [2j outlier samples would 
have a small membership value; the optimization problem being solved factors 
in these membership values, thus allowing a more robust classifier to be learnt. 

The fuzzy MCM classifier aims to learn a hyperplane classifier that has a 
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Figure 2: Discounting classification errors on outliers may allow a classifier with 
a smaller VC dimension to be learnt. Outliers that contribute to large classifi¬ 
cation errors may often not be representative of the class that labels indicate. 
Discounting classification errors on such samples allows for more robust learn¬ 
ing. The use of fuzzy memberships provides for a natural way to measure how 
important it is to correctly classify a given sample. 


small VC dimension, and that also minimizes a weighted measure of the clas¬ 
sification error on training samples. The linear fuzzy MCM (FMCM) classifier 
does this by solving the following optimization problem. 


M 


Min h + C ■ SiQi 

w.b.h.q * ^ 

2=1 

(14) 

h> yi ■ [w T x l + b] + qi, i = 1,2,..., M 

(15) 

yi ■ [w T x l +b\ + qi> 1, z = 1,2,..., M, 

(16) 

qi> 0 

(17) 


Here, the fuzzy membership s t for the i — th sample is used to determine the 
importance of the sample in terms of its possible mis-classification. This implies 
that samples with a small value of the fuzzy membership, such as outliers, can 
be ignored or accorded less importance when learning the classifier. This makes 
the classifier less sensitive to outliers, leading to more robust learning. In the 
example of Fig. [2J the values of s* for the outlier samples are small. This implies 
that the objective function d discounts the errors caused when learning such 
samples, because of the small values of the weights Sj. 

In the following section, we show how the linear fuzzy MCM can be extended 
to the kernel case. 

4. The Fuzzy Kernel MCM 

We consider a map that maps the input samples from Si™ to Si 1 ", where 
r > n. The separating liyperplane in the image space is given by 

u T (j){x ) + v = 0. (18) 

Following (HU) - d the optimization problem for the fuzzy kernel MCM 
may be shown to be 
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( 19 ) 


M 

Min h + C • Siqi 

w.b.h.q ‘ ^ 

i= 1 

h>yi ■ [w T (j)(x') + 6] + ft, * = 1,2,M 
y* • [w T 4>(x l ) + 6] + qt > 1, i= 1,2 ,M 
Qi > 0, i = 1,2,..., M. 


( 20 ) 

( 21 ) 

( 22 ) 


The image vectors 4>{x l ),i = 1,2form an overcomplete basis in the 
empirical feature space, in which iu also lies. Hence, we can write 

M 


w = J2 X j^ x3 )- 
1=1 


(23) 


Note that in (l23l) . the <()(a, J ) ! s for which the corresponding Aj’s are non-zero may 
be termed as support vectors. 

Therefore, 

M M 

w T (f>{x l ) + b = ^ \j(j>{x^) T (^{x 1 ) + b = ^ XjK(x l , x J ) + b , (24) 

j=i l=i 

where if (p, q) denotes the Kernel function with input vectors p and q , and 
is defined as 

7f(p,g) = 0(p) T </ > ('?)- (25) 


Substituting from C41) into (1TT)1) - (12T1) , we obtain the following optimization 
problem. 


M 


Min h + C ■ Siqi 

w,b,h,q ' ^ 

2—1 

(26) 

M 


h>yi ■ [^2 A jK(x\x j ) + b\ + qi, * = 1,2,..., M 

4 — 1 

(27) 

J — - L 

M 


Vi ' XjK(x l ,x 3 ) + 6] + ft > 1, i = 1,2,..., M 

l=i 

(28) 

ft > 0, t = 1,2,..., M. 

(29) 
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Once the variables A j,j = 1, 2 ,M and b are obtained, the class of a test 
point x can be determined by evaluating the sign of 

M 

f(x) = w T (f>(x)+b = ^ XjK(x, ad) + b. (30) 

j =i 

Results on benchmark datasets indicate that the use of fuzzy memberships in 
the FMCM can reduce the number of support vectors and also lead to improved 
accuracies on test data. In the sequel, we present results on the linear and kernel 
versions of the fuzzy MCM. 


5. Experimental results 


The FMCM was coded in MATLAB. The code is available on request from 
the author. Fuzzy membership values were computed by using the approach 
outlined in js|. In this case, the membership value s-i of the i-th sample is 
a function of its distance from its class centre. Lin and Wang suggested the 
formula 

11 ^+ ~ x i 


Si = 


1 - 


r+ + (5 
\x— — Xi 


if y t = 1 , i.e., the sample belongs to class 1 
if yi = —1 , i.e., the sample belongs to class -1 


(31) 


r- + S ’ 

Here, r + and r_ are the radii of the two classes, and x + and X- are the respective 
class centres. The scalar S is a small number used to ensure that Sj does not 
become zero. Figure [3] illustrates the computation of the fuzzy membership. 

In order to evaluate the FMCM, we chose a number of benchmark datasets 


H- 


from the UCI machine learning repository [10j |. Table Q] summarizes information 
about the number of samples and features of each dataset. 

Table [5] summarizes five fold cross validation results of the fuzzy linear MCM 
on a number of datasets taken from the UCI machine learning repository. Ac¬ 
curacies refer to the test sets, and are indicated as mean ± standard deviation, 
computed using a standard five fold cross validation methodology. The table 
compares the linear MCM with LIBSVM using a linear kernel. The values of C 
were determined for the FMCM by performing a grid search. 
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T' 


Figure 3: The figure illustrates the computation of fuzzy memberships. The 
fuzzy membership of a sample depends on its distance from its class centre, as 
well as the radius of the corresponding cluster. 



Table 1: Characteristics of the Benchmark Datasets used 


dataset 

Size (samples x features) 

fertility diagnosis 

100 x 9 

promoters 

106 x 57 

echocardiogram 

132 x 12 

hepatitis 

155 x 19 

plrx 

182 x 12 

heartstatlog 

270 x 13 

horsecolic 

300 x 27 

haberman 

306 x 3 

australian 

690 x 14 

crx 

690 x 15 

transfusion 

748 x 5 
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Table 2: Linear Fuzzy MCM: Test Set Accuracies 


datasets 

Linear Fuzzy MCM 

Linear Fuzzy SYM 

haberman 

74.47 

± 

3.58 

73.87 

± 

3.06 

transfusion 

76.19 

± 

4.12 

76.32 

± 

4.12 

echocardiogram 

88.63 

± 

2.44 

84.84 

± 

5.40 

plrx 

71.83 

± 

7.49 

71.42 

± 

7.37 

crx 

70.00 

± 

3.13 

68.55 

± 

2.45 

horsecolic 

81.00 

± 

4.03 

80.00 

± 

4.35 

australian 

85.81 

± 

2.01 

85.36 

± 

1.55 

fertility diagnosis 

88.00 

± 

9.27 

86.00 

± 

9.01 

hepatitis 

67.09 

± 

5.55 

60.64 

± 

7.19 

pirna indian diabetes 

77.33 

± 

5.54 

74.84 

± 

6.62 

promoters 

74.08 : 

± 

10.88 

69.83 : 

± 

11.52 

mammographic masses 

83.48 

± 

5.13 

79.73 

± 

5.45 

voting 

94.94 

± 

0.92 

94.71 

± 

1.17 

heart statlog 

85.18 

± 

2.62 

83.70 

± 

2.72 

breast 

96.83 

± 

1.19 

96.48 

± 

1.24 

bands 

73.86 

± 

4.13 

72.55 

± 

5.00 
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Table [3] summarizes five fold cross validation results of the Fuzzy kernel MCM 
on a number of datasets. A Gaussian kernel was used for both the FMCM and 
the FSVM. The width of the Gaussian kernel was chosen by using a grid search. 

Table 3: Kernel Fuzzy MCM results 


datasets 


Test Set Accuracy 


# Support Vectors 


Fuzzy Kernel MCM 


Fuzzy SVM 


Fuzzy Kernel MCM Fuzzy SVM 


haberman 

74.82 ± 3.72 

72.86 ± 3.20 

7.80 ± 6.65 

138.20 ± 2.93 

transfusion 

79.27 ± 4.20 

77.80 ± 4.05 

22.80 ± 11.34 

299.06 ± 10.46 

echocardiogram 

88.57 ± 6.14 

87.13 ± 6.49 

24.80 ± 6.14 

48.00 ± 3.29 

plrx 

71.41 ± 6.75 

71.42 ± 7.37 

7.00 ± 5.34 

116.20 ± 5.49 

crx 

71.01 ± 1.89 

68.84 ± 2.79 

92.80 ± 53.95 

404.40 ± 7.74 

horsecolic 

81.00 ± 3.27 

79.66 ± 4.64 

35.40 ± 15.91 

187.20 ± 2.93 

australian 

85.50 ± 1.72 

86.08 ± 1.61 

107.60 ± 5.92 

244.80 ± 4.12 

fertility diagnosis 

91.00 ± 8.60 

88.00 ± 9.27 

17.70 ± 9.09 

39.00 ± 6.23 

hepatitis 

69.03 ± 8.80 

62.57 ± 8.31 

44.00 ± 39.94 

104.20 ± 2.64 

pima indian diabetes 

76.55 ± 3.05 

76.81 ± 3.31 

112.80 ± 75.17 

355.40 ± 7.45 

promoters 

79.41 ± 3.56 

76.46 ± 5.83 

78.40 ± 10.71 

84.80 ± 0.40 

mammographic masses 

82.01 ± 4.50 

82.12 ± 3.92 

61.80 ± 9.41 

332.00 ± 14.46 


A comparison with the fuzzy SVM indicates that the fuzzy MCM yields 
better generalization with fewer support vectors. An examination of the table 
indicates that the proposed approach shows a lower test set error, and also uses 
a smaller number of support vectors. It is also interesting to note that the 
fuzzy MCM outperforms the classical MCM in terms of the number of support 
vectors and test set accuracies. The results of the classical MCM have not 
been duplicated from Q for the sake of brevity; an added reason is that a fair 
comparison would be between two methods that use a fuzzy methodology. 

As an interesting illustration of the sparsity of the fuzzy MCM, consider a 
fuzzy kernel MCM classifier using a randomly chosen subset comprising 80% 
samples of the ’haberman’ dataset, that employs a Gaussian kernel. This classi¬ 
fier may be tested by the reader on any randomly chosen set of training samples. 
It is interesting because it uses only four support vectors and can be written 
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down as the following closed form expression. 


f(xi,x 2 , 2 : 3 ) =sign{— 105.8063 exp[— 10 4 * ((xl — 36) 2 + (x2 — 69) 2 + x3 2 )] 

+ 90.5143 exp[- 10“ 4 ((xl - 43) 2 + {x2 - 58) 2 + (x3 - 52) 2 )] 

+ 129.7232 exp[- 10“ 4 ((xl - 54) 2 + (x2 - 67) 2 + (x3 - 46) 2 )] 

- 113.7966 exp[- 10“ 4 ((xl - 62) 2 + (x2 - 58) 2 + x3 2 )] 

- 0.7661} (32) 

Here, the input samples are in three dimensions, and given by (xi, £ 2 , £ 3 ). 

6. Conclusion 

In this paper, we propose a way to build a fuzzy hyperplane classifier, termed 
as the fuzzy Minimal Complexity Machine (MCM), that learns a fuzzy cassifier 
with small VC dimension. The fuzzy MCM involves the solution of a linear 
programming problem. Experimental results show that the fuzzy MCM outper¬ 
forms the fuzzy SVM in terms of test set accuracies on a number of selected 
benchmark datasets. At the same time, the number of support vectors is less, 
often by a substantial factor, often as large as 10 or more. It has not escaped 
our attention that the proposed approach can be extended to fuzzy least squares 
classifiers, as well as to tasks such as fuzzy regression and fuzzy time series pre¬ 
diction; in fact, a large number of variants of fuzzy SVMs can be re-examined 
from the perspective of the fuzzy MCM. 
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